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Abstract 

We estimate the effect of class size on student performance in 18 
countries, combining school fixed effects and instrumental variables to 
identify random class-size variation between two adjacent grades within 
individual schools. Conventional estimates of class-size effects are 
shown to be severely biased by the non-random placement of students 
between and within schools. Smaller classes exhibit beneficial effects 
only in countries with relatively low teacher salaries. While we find 
sizable beneficial effects of smaller classes in Greece and Iceland, the 
possibility of even small effects is rejected in Japan and Singapore. In 1 1 
countries, we rule out large class-size effects. 
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on Education Policy and Governance (PEPG) at the Kennedy School of Government, Harvard 
University, and the National Bureau of Economic Research (NBER) for their hospitality during his stay in 
Cambridge, Mass., during which the main work on this paper was completed. 



I. Introduction 



School systems around the world differ in many respects. Important sources of variation include 
examination systems, the existence of high-stakes incentives for students and teachers, the 
provision of remedial instruction for lagging students or of enrichment classes for outstanding 
students, the level and allocation of resources, the quality of the teaching force, and average class 
size. Given these differences, it is not obvious that findings from any particular school system 
translate directly into general principles for all systems. Although the effect of class size on 
student achievement in the United States has recently been the subject of a great deal of research, 
the U.S. findings simply may not generalize to school systems in other parts of the world with 
distinctive institutional configurations. This paper explores this possibility by providing 
estimates of class-size effects in 1 8 education systems scattered across four continents. 

The central problem in estimating class-size effects is that placement decisions made by 
parents and schools obscure the causal relationship between class size and student performance. 
For example, parents may place children in schools with bigger or smaller class sizes on the basis 
of their educational performance; administrative rules may track students into different schools 
depending on their achievement; and individual educators may sort students within a school into 
differently sized classes according to their behavior or demonstrated academic potential. As a 
result, nai've estimates of education production functions may be biased both by endogeneity of 
class size with respect to student performance and by omitted variables. Estimating “true” class- 
size effects, i.e. the causal effect of class size on student performance, thus requires an 
identification strategy that restricts the analysis to exogenous variations in class size, thereby 
allowing for the causal class-size effect to be disentangled from the effects of sorting. 

In principle, two such strategies are available. The first is to conduct an experiment, using 
random assignment of students to classrooms to ensure that all variation in class size is 
exogenous. The second is to adopt a quasi-experimental approach in which instrumental variable 
(IV) estimates are used to restrict the analysis to that part of the total variation in class size that is 
exogenous to student achievement. 

Evidence from the one large-scale random-assignment experiment on class-size effects, the 
Tennessee Student/Teacher Achievement Ratio experiment (“Project STAR”), has been analyzed 
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both in terms of its initial impact on student achievement (Krueger 1999) and in its longer-term 
consequences for academic progress (Krueger and Whitmore 2001). Unfortunately, however, the 
validity of this experiment may actually have been undermined by specific decentralized 
placement decisions; non-random parental choices prior to the start of the experiment - e.g. not 
to send their children to participating schools if they were assigned to larger classes - cannot be 
ruled out and would bias any estimate of class-size effects. Several other issues of design and 
implementation of Project STAR also call into question the validity of its results (Hanushek 
1999). Furthermore, any experiment suffers from the so-called “Hawthorne effect” in that 
participants are aware that they are being evaluated, and may respond by increasing their effort. 
Knowledge of one’s participation in an experiment can also alter the prevailing incentive 
conditions in important ways. For example, the schools participating in Project STAR may have 
realized that their future resource endowments would be affected by the outcome of the 
experiment, and may have adjusted their behavior accordingly (Hoxby 2000). In short, the use of 
randomized experiments to assess the effects of class size has intrinsic problems, and the 
implementation of the one major class-size experiment seems to have been less than optimal. 
And it must be recalled that we have evidence from only one experiment, conducted in a single 
U.S. state in the mid-eighties. The near universal popularity of country music notwithstanding, 
the situation in Tennessee simply may not be representative of school systems in other parts of 
the world. 

Studies using quasi-experimental evidence also have important disadvantages. Principle 
among them is the need to examine rather specific variations in class size that make it possible to 
disentangle the causal effect of class size on student achievement from the results of sorting. As a 
consequence, studies using this kind of identification strategy are also only available for a few 
countries and situations. Angrist and Lavy (1999) exploit a specific rule on maximum class size 
in Israel to extract presumably exogenous variation in Israeli class sizes. While this identification 
strategy excludes class-size variations due to student assignments within a school, it is not 
immune to bias from parental residential choice. Moreover, they are only able to analyze the 
effects of variation in class size between 20 and 40 students, which may not be the range most of 
interest to policy-makers in many countries. Case and Deaton (1999) identify class-size effects by 
looking at data on black students in South Africa during apartheid, arguing that the variation in 
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class sizes for black students was largely exogenous, because the black population at this time 
had neither freedom of residential choice nor control over their schools’ endowments. But the 
South African school system during apartheid was obviously unique in its institutional 
configuration, and was characterized by district-average class sizes of up to 80 students. It is 
therefore unclear whether the results are relevant to more advanced countries. Hoxby (2000) 
exploits variation over time in student enrollments due to random fluctuations in the timing of 
births and district rules regarding maximum or minimum class sizes to identify exogenous 
variation in class sizes, applying this approach to elementary schools in the U.S. state of 
Connecticut. Unfortunately, her identification strategies require a long panel of rich data and 
have yet to be applied in other contexts. 

In this paper, we use the international database of the Third International Mathematics and 
Science Study (TIMSS) and develop a new identification strategy that provides unbiased 
estimates of the effects of class size on student achievement in a host of school systems from all 
over the world. The TIMSS database provides data on representative samples of students in the 
two adjacent grades with the highest share of thirteen-year-old students from about 40 countries, 
1 8 of which have data rich enough to support the implementation of our identification strategy. 
Our identification strategy is designed to take advantage of two unique characteristics of this 
database. Specifically, it exploits the fact that the TIMSS database contains information on the 
performance and class size of students in two adjacent grades of each school taking the same 
achievement test, as well as on the average class size in each grade of each school. 

In a nutshell, our identification strategy uses the part of the between-grade difference in class 
size in a school that reflects differences in the school’s average class size between the two grades 
to predict that part of the between-grade difference in student performance that is idiosyncratic to 
the school. In doing so, we exclude both between-school and within-school sources of student 
sorting. Between-school sources of student sorting are eliminated by controlling for school fixed 
effects, while within-school sorting is filtered out by instrumenting actual class sizes by the 
average class size in the relevant grade at each respective school. The remaining variation in 
class size between classes at different grades of a school is random, and presumably reflects 
natural fluctuations in student enrollment. We can use this random variation to identify the causal 
impact of class size on student performance. 
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The paper is organized as follows. Section II details our identification strategy, while Section 
III illustrates the basic intuition behind this strategy with two examples. Section IV introduces 
our data. In Section V, we present our estimates of causal class-size effects and compare them to 
naive estimates of class-size effects. We also discuss the precision and magnitude of our 
estimates in greater detail, comparing them to previous estimates from the United States. Section 
VI concludes with some observations about the relationship between the institutional 
characteristics of school systems and the existence of class-size effects. 



II. The Identification Strategy 



A. The Standard Method and Potential Sorting Biases 

The standard method to estimate the relationship between class size and student performance is a 
least-squares (LS) regression of test scores on class size, controlling for a set of family- 
background characteristics. Basically all of the estimates of education production functions 
surveyed in Hanushek (1986, 1996) and Krueger (2000) use this method. Assuming that we use 
test-score data from different grades, the following education production function would be 
estimated: 



( 1 ) 



T ,cgs = a \ s c+ Ctrl lcgs /3 + yG g + v c + e icgs , 



where T icgs is the test score of student / in class c at grade level g in school s, S is the class size, 
Ctrl is a vector of controls for student- and family-background characteristics, and G is the grade 
level. The coefficients or/, /?, and y are parameters to be estimated, u is a class-specific 
component of the error term, and 8 is a student-specific component of the error term. The 
following subscripts are applied throughout: i is for student, c is for class, g is for grade level, 
and s is for school. 

While this identification method has been commonly used in the literature, it is clearly naive 
to interpret the estimated parameter or/ as a causal effect of class size on student performance. 
The difficulty is that the variation in class sizes S is not necessarily exogenous to the variation in 
test scores T. There are any number of plausible ways in which class size may be influenced by 
student performance. Parents of high-performing students may choose to live in residential 
districts with small class sizes to better foster their abilities. On the other hand, it might also be 
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the case that parents of poorly performing students may choose schools with small class sizes 
because they feel that their children need extra attention. Schools may set up smaller remedial 
classes for laggards, or they may establish special enrichment classes for their most talented 
students. Likewise, the school system as a whole may track students of different performance 
levels into different kinds of schools with different average class sizes. In short, all kinds of 
placement mechanisms are at work in every school system, and a priori it is not even clear 
whether their overall effect is to place the worse- or the better-performing students in smaller 
classes. 

In effect, every decision by parents, schools, or administrative entities that places students of 
different performance levels into classes of different size introduces sorting effects. These sorting 
effects influence the naively estimated relationship between class size and student performance, 
so that the coefficient estimate ai is a mixture of the “true” class-size effect (the causal impact of 
class size on student performance) and of the consequences of sorting. The diversity and 
decentralized character of these decisions makes it impossible to control for the effect of sorting 
by including additional variables in the regression. Some kind of omitted variable bias would 
inevitably remain, and it may be fallacious to assume it to be of second-order magnitude. Instead, 
we need a strategy to identify causal effects of class sizes on student performance that bases its 
estimation on exclusively exogenous variation in class size. 

B. School Fixed Effects to Account for Between-School Sorting 

We can usefully divide the different kinds of sorting into two broad categories: sorting taking 
place between schools, such as residential choice or tracking by schools, and sorting taking place 
within schools, such as parents pressuring their children to be placed into particular classes or 
heads of schools assigning students to different classes. The development of the identification 
strategy used in this paper proceeds through two stages, each of which eliminates one of these 
two categories of sorting effects. 

The strategy used to eliminate the effects of between-school sorting is to control for school 
fixed effects (SFE). Any systematic between-school variation, stemming from any source 
whatsoever, is thereby removed when estimating the class-size . effect. This strategy is 
implemented simply by including a dummy variable for each school: 
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T icgs = a 2 S c + Ctrl icgs p + yG g + D s S + u c + e icgs , 

where D is a vector of school dummies. Obviously, this identification strategy requires that our 
dataset contain information on more than one class from each school. 

C. Instrumental Variables to Account for Within-School Sorting 

Even having controlled for school fixed effects, however, the estimates produced by equation (2) 
might still be biased by sorting taking place within schools wherever there is more than one class 
per grade in a school. We therefore apply an instrumental variables (IV) strategy to ensure that 
only an exogenous part of the class-size variation is used to estimate the causal class-size effect. 
To be used as an instrument, a variable should be highly correlated with the endogenous 
explanatory variable (class size), but causally unrelated with the dependent variable (student 
performance). That is, the instrument should have no effect on the dependent variable apart from 
its indirect effect through the endogenous explanatory variable, and it should not be endogenous 
to the dependent variable. 

The variable we use to instrument for the actual class size is the average class size at the 
respective grade level of the school. 1 It is expected - and it is shown below - that schools’ 
average class size in each particular grade is highly correlated with the actual class size 
experienced by their students in that grade. 2 There is no reason to expect that the average class 
size would affect the performance of students in a specific class except for through its effect on 
the actual size of the class of the students. Furthermore, we do not see how student performance 
should have an impact on the grade-average class size, once any school fixed effect is accounted 
for. Given this instrument, the second stage of the two-stage least-squares (2SLS) estimation is 
then: 



1 The average grade-level class size was first applied as an instrument for actual class size in Akerhielm 
(1995). However, as Akerhielm did not control for school fixed effects, her estimates may still be biased by between- 
school sorting effects. Furthermore, Akerhielm also used the overall grade-level enrollment of a school as a second 
instrument in addition to average class size. However, this may be a false instrument as there might be a direct 
relationship between overall enrollment and student performance that is unrelated to differences in class size (cf. 
Angrist and Lavy 1999). Moreover, none of the coefficients on enrollment in Akerhielm’s first-stage regressions are 
significant, suggesting that it is not a good instrument. 

2 When there is only one class at a grade level in a particular school, actual and grade-average class size will 
be equal and the problem of within-school sorting does not exist. 
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( 3 ) T icgs =a 3 S c +Ctrl icgs /3+yG g +D s S + o c +e icgs , 

where S c is the predicted value of the first-stage regression of actual class size 5} on the average 
class size of the grade level in the school Aj and the other exogenous variables: 

(4) 5, = <j>A c + Ctrl ic J +} G g + D s 5 + u c .+ ^ . . 

The average difference in performance between students from the adjacent grades is 
controlled for by the grade-level dummy G, so that the remaining performance difference 
between the classes from the different grades is idiosyncratic to each school. This idiosyncratic 
variation in student performance is then related to that part of the actual class-size difference 
between the two grades that is due to differences in average class size between the two grades. 
Arguably, this remaining class-size variation is caused by random fluctuations in cohort size 
between two adjacent grades of a school. The coefficient estimate aj can thus be interpreted as a 
true estimate of the causal impact of class size on student performance which is unbiased by 
within-school and between-school sorting. 

Because equation (3) includes school fixed effects, and because every class size at a given 
grade level is instrumented by the same average class size, this IV strategy (SFE-IV) requires that 
we have comparable information on student performance from more than one grade level in each 
school. As the same achievement test can only sensibly be administered to different grade levels 
if the students’ performance levels are not too far apart, the grade levels should be adjacent. In 
short, our identification strategy requires a dataset with very unique characteristics. 3 

The class-size variation on which the estimate is based, namely within-school between- 
grade variation, certainly is a rather specific one. Any differences in class size within one grade 
and any differences in class size between schools are excluded from the analysis. However, as 
will be discussed below, this variation has the distinct advantage of being in the relevant range of 
variation for potential policy initiatives in each country. The variations in class size analyzed here 



3 Additionally, there should not be institutional differences in the rules determining class size between the two 
adjacent grades, which might introduce non-random differences in class sizes between the two grade levels. Even if 
there were such institutional differences, however, the inclusion of a grade dummy in all the equations ensures that 
the estimated class-size effects will be unbiased as long as the existence of the rule is unrelated to student 
performance. 
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are generally of a magnitude that may be affordable given the budget constraints on class-size 
reduction, and they occur by design at the level most relevant for each country. 

III. Two Illustrative Examples 

Before actually implementing this identification strategy, we first present two graphical examples 
that illustrate visually the basic intuition behind our identification strategy. The specific examples 
we use - the mathematics performance of students in Singapore and Iceland - are chosen purely 
on the basis of their capacity to demonstrate the advantages of our identification strategy. While a 
more thorough discussion of the data is relegated to Section IV, it suffices here to point out that it 
comes from the Third International Mathematics and Science Study (TIMSS), which tested 
representative samples of seventh- and eighth-grade students in a host of countries. As a general 
rule, one seventh-grade class and one eighth-grade class were tested in each school. TIMSS 
mathematics test scores were scaled to an international mean of 500 and an international standard 
deviation of 100. For these illustrative examples only, we do not use student-level data, but rather 
the average test score in each classroom. Nor do we yet control for family-background 
characteristics. 

A. Class Size and Mathematics Performance in Singapore 

In Singapore, we have 268 classes in our sample - 134 schools with one seventh-grade class and 
one eighth-grade class each. With an average mathematics test score of 623, students in 
Singapore are the best performers of all countries participating in TIMSS. The average class size 
in Singapore is 33.2. Figure 1 plots the average test-score performance of students in class-size 
blocks of five students. Each block with five students more on average has a higher average level 
of performance than the previous block, indicating that students in larger classes perform better 
than students in smaller classes. 4 The same counterintuitive pattern is apparent in the top panel of 



4 This pattern of performance steadily increasing with class size in Singapore is driven mainly by 
performance differences within seventh grade. Within eighth grade, the only statistically significant difference in 
performance between the different blocs of class sizes is that classes with more than 39 students scored higher, on 
average, than classes with 35-39 students. Within seventh grade, all the performance differences between 
consecutive blocks reported in Figure 1 are statistically significant excepting 35-39 versus 40-45 and 10-14 versus 
15-19. 
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Figure 2, which presents a scatter plot of class-average test scores versus class size. 5 Note that 
this positive, correlation is not driven by outliers or non-linearities. Rather, the relationship 
between class size and student performance appears to be quite linear. Interpreting this 
correlation as causation would lead to the unexpected conclusion that larger classes facilitate 
student learning. As argued above, however, this relationship between performance and class size 
is likely to be spurious, reflecting the differential sorting of students between and within schools. 

Looking at differences-in-differences allows us to control for the effects of between-school 
sorting. That is, for each school, we measure both the difference in average student performance 
between seventh and eighth grade and the difference in class size between seventh and eighth 
grade. This procedure removes any difference in the overall performance levels between schools 
(school fixed effects), leaving only within-school variation in both test scores and class sizes. 
The middle panel of Figure 2 plots within-school differences in performance against within- 
school differences in class size. Once again we observe a statistically significant positive 
correlation between performance differences and class size, although the size of the positive 
correlation is substantially reduced. This reduction suggests that on average in Singapore, poorly 
performing students seem to be sorted into schools with smaller classes. 

However, even the differences-in-differences picture might be distorted by various types of 
student sorting that occur within schools. The next step in our identification strategy accordingly 
attempts to eliminate any effects of within-school sorting by using only that part of the between- 
grade variation in actual class sizes that reflects variations in grade-average class sizes. We first 
regress the between-grade difference in actual class size on the between-grade difference in 
grade-average class size (that is, we instrument actual class size by grade-average class size), and 
then use the predicted between-grade difference in class size for each school from this regression 
as the measure of between-grade difference in class size on the horizontal axis of the bottom 
panel of Figure 2. This scatter plot reflects the basic idea of our identification strategy: It relates 
that part of the between-grade difference in class size in each school that reflects differences in 
the average class size of the two grades to the difference in student performance between the two 
grades in the school. Having eliminated the effects of student sorting both between and within 



5 For purposes of clarity, the trend line in the top panel of Figures 2 and 3 does not control for the grade level 
of each class. However, trend lines controlling for grade level would look just the same in both cases. 



schools, we interpret the bottom panel of Figure 2 as a picture of the causal effect of class size on 
student performance. The picture suggests that class size has no causal effect on student 
performance whatsoever in mathematics in Singapore. Rather, weaker students seem to be 
consistently placed in smaller classes, both between and within schools. 

B. Class Size and Mathematics Performance in Iceland 

The second country we use to illustrate our identification strategy is Iceland. The mathematics 
sample in Iceland consists of 131 classes in 65 schools (there was one school where two seventh- 
grade classes were tested). The average TIMSS test score in mathematics in Iceland was 467, and 
the average class size 20.3. Figure 3 depicts the same three scatter plots for Iceland that were 
depicted in Figure 2 for Singapore. 

The top panel of Figure 3 shows that class size and mathematics performance in Iceland are 
uncorrelated. Note that there are some extremely small classes in Iceland; however, these do not 
reflect unusually small schools, which were excluded from the TIMSS sample. Using 
differences-in-differences to exclude between-school differences in performance levels in the 
middle panel again reveals no obvious relationship between class size and performance. The lack 
of a substantial change in the slope of the trend lines between the first two panels of the figure 
suggests that in Iceland, unlike in Singapore, students of lower ability are not systematically 
sorted into schools with smaller classes. The bottom panel of Figure 3 again provides the picture 
most representative of our identification strategy, which excludes any sorting effects. This final 
picture reveals a negative relationship between class sizes and student performance - smaller 
classes seem to cause better mathematics performance in Iceland. 6 

Although the simple correlation between class size and student performance in Iceland 
initially suggests that there is no relationship between the two, this lack of correlation cannot be 
taken at face value. Our identification strategy reveals that smaller classes do in fact enhance 
students’ learning in mathematics in Iceland. In this simple class-level correlation without control 
variables, the negative coefficient on class-size differences is statistically significant at the 10 
percent level. The class-size coefficient is slightly larger than 2 (in absolute terms), implying that 



6 The result stays virtually unchanged when the twooutlying observations at the right-hand side of the graph 
are dropped. Additionally dropping the outlying observation at the bottom of the graph, the coefficient on class size 
grows (in absolute terms) to -3.01 and is statistically significant at the 5 percent level. 
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a class size smaller by one student elevates student performance by 2 TIMSS test-score points. 
That is, a class that is 5 students (or a quarter of the average class size in Iceland) smaller than 
another one would have performed, on average, slightly more than 10 test-score points (or 10 
percent of an international standard deviation in TIMSS test scores) better as a result of the class- 
size effect. 

C. Examples of Individual Schools in Iceland 

The basic features of our identification strategy, and of the class-size/performance link in Iceland, 
can be further illuminated by looking at several cases of individual schools. The three schools, A 
to C, that we discuss here are all real schools taken from the TIMSS data for Iceland. In school 
A, the sampled seventh-grade class has 21 students, and the sampled eighth-grade class has 25 
students. The same is true for the average class sizes in seventh and eighth grade in this school, 
suggesting that the school may only have one class in each of these two grades. The seventh- 
grade class is thus smaller, both on average and actually, in school A. The average performance 
of the seventh-grade students sampled in school A is 462, and in eighth grade it is 473. That is, 
the tested eighth-graders in school A performed 1 1 test-score points better than the seventh- 
graders tested in school A. On average across all schools in Iceland, however, eighth-graders 
performed 3 1 points better than seventh-graders. This means that the smaller class size in seventh 
grade in school A might have led to a lag in performance relative to eighth-graders that is smaller 
than the lag usually observed. It is informative to note here that a between-school evaluation in 
this case would have led to the opposite, counterintuitive result. The average test score of 
seventh-graders in Iceland is 450, and the average class size (when averaged over classes, not 
students) is about 14.5 in both seventh and eighth grade. Although the size of the seventh-grade 
class in school A is significantly above average, its performance is also above average. However, 
this between-school variation might be contaminated by various forms of sorting. 

In school B, the tested seventh-grade class has 26 students, and the tested eighth-grade class 
has 19. The grade-average class sizes in school B are 25 in seventh grade and 17 in eighth grade. 
That is, the tested eighth grade is smaller than the tested seventh grade, and this difference seems 
to be caused by a smaller student cohort in eighth grade in school B. The seventh-graders scored 
429 points on average, the eighth-graders 494. The lead of the eighth-graders is thus 65 test-score 



points, which is substantially larger than the country-average lead of 31 points, and we would 
attribute the relatively better performance of the eighth-graders to their smaller class size. 

In school C, the seventh-grade class actually tested was larger by 3 students than the tested 
eighth-grade class (24 versus 21 students). The lag in performance, however, was only 13 test- 
score points (as compared to the country-average lag of 3 1 points). As such, this would seem 
counterintuitive. However, the average class size in seventh grade in school C was 23, while it 
was 24 in eighth grade. That is, the tested eighth-grade class was smaller by 3 students than the 
average eighth-grade class. It might be suspected that the tested eighth-grade class is one where 
poorer-performing students had been sorted into a smaller-than-average class, perhaps in an 
effort to provide them with extra attention. Therefore, the relatively small lead of tested eighth- 
graders in school C might have nothing to do with a causal class-size effect, but might be due to 
within-school sorting. 

These illustrative examples at the country and the school level confirm that it can be highly 
misleading to take naive estimates of class-size effects for causal effects. However, by applying 
an identification strategy that accounts for sorting effects, causal class-size effects can be 
distilled. The preliminary analyses presented here suggest that there does not seem to be a causal 
class-size effect on mathematics performance in Singapore, but that smaller classes do lead to 
superior mathematics performance in Iceland. The difference between the two results reinforces 
the importance of assessing the impact of class-size resources independently for different school 
systems. 

IV. Data and Descriptive Statistics 

A. Some Background on the TIMSS Database 

As indicated in Section II, the proposed identification strategy is rather demanding in its data 
requirements. Specifically, it requires a dataset with two features: first, performance, class-size, 
and student-background data from more than one grade level in each school taking the same 
achievement test; and second, additional information on the average grade-level class size for 
each grade in each school. The data collected in the Third International Mathematics and Science 
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Study (TIMSS) for a host of countries is the only large-scale dataset we are aware of that meets 
these stringent requirements. 7 

TIMSS, conducted in 1994/95 by the International Association for the Evaluation of 
Educational Achievement (IEA), was the largest and most encompassing international study of 
student performance ever conducted, with more than 40 countries initially participating. Each of 
these countries administered the test to a nationally representative sample of middle school 
students, defined as those students enrolled in the two adjacent grades that contained the largest 
proportion of 13-year-old students at the time of testing (grades seven and eight in most 
countries). All countries endorsed the curriculum framework which was set up to ensure that the 
test content was appropriate for the students in both grades and reflected their current curriculum. 
Students were tested in a wide array of content dimensions in mathematics and science, using 
both ffee-response and multiple-choice items. In addition, extensive background information was 
gathered through student, teacher, and school-principal questionnaires. In the end, datasets for the 
middle school years were made available for 39 school systems around the world. 

Student performance in mathematics and science were measured separately using the scale of 
international achievement scores, which have an international mean of 500 and an international 
standard deviation of 1 00. Data on the actual class size of each mathematics and science class is 
available in the background questionnaires completed by each teacher. Data on the school-level 
average class size in grades seven and eight are available from the school-principal background 
questionnaires. Finally, family background data is contained in the student background 
questionnaires. We use the international TIMSS database constructed by WoBmann (2000), 
which merged performance data and data from the different background questionnaires for each 
individual student. This database also includes imputed data for missing values of the variables 
contained in the background questionnaires. Complete performance data is available for all 
participating students. 

7 Note that not even the other recent international student achievement tests allow for an implementation of 
our identification strategy. In the repeat study of TIMSS conducted in 1999, data was collected for students from 
only one grade (eighth, but not seventh), making the between-grade assessment within each school which is 
necessary to implement our identification strategy impossible. In the Programme for International Student 
Assessment (PISA), conducted by the OECD in 2000, the target population was that of 15-year-old students, so the 
sampling frame did not provide for a clear sampling of two classes in two grades per school. Furthermore, the PISA 
school questionnaire does not provide data on grade-average class size, which would be necessary to implement our 
identification strategy. 



Each country was meant to collect data for a sample of at least 150 schools. While a few 
countries did not reach this target, others like Canada sampled as many as 429 schools. 
Generally, one class per grade was selected at random within each sampled school, and all of its 
students tested. 8 Some countries tested more than one class per grade. Schools in geographically 
remote regions, extremely small schools, and schools for students with special needs were 
excluded from the target population. Within sampled schools, disabled students who were unable 
to follow even the test instructions were excluded; students who merely exhibited poor academic 
performance or discipline problems were required to participate (Foy et al. 1996; s. a. Martin and 
Kelly 1998: Appendix B). The overall exclusion rate was not to exceed 10 percent of the total 
student population. 

To be able to implement our identification strategy, we were forced to restrict the sample to 
those schools in which both a seventh-grade and an eighth-grade class were actually tested. 
Furthermore, for a school to be included, both data on the actual class size and data on the grade- 
average class size had to be available for both the seventh-grade and the eighth-grade class. This 
second criterion ensures that our class-size estimates are based on non-imputed values for our 
variables of interest: actual class size, instrument, and student performance. We ultimately 
conducted our analysis on the 18 of the 39 countries for which data for at least 50 schools in both 
mathematics and science remained after applying these criteria. Appendix 1 details the specific 
reasons for the exclusion of each of the other TIMSS participants. 

B. Descriptive Statistics 

The number of students, classes, and schools per country in our mathematics and science sample 
are presented in the first three columns of Tables 1 and 2. In mathematics, the number of schools 
ranges from 55 in Hong Kong to 168 in Canada; in science, it ranges from 50 in Hong Kong to 
148 in Japan. The smallest number of students is in Iceland (1,448 in science), the largest in 
Japan (10,142 in mathematics). Tables 1 and 2 also present descriptive statistics of the dataset. 
Portugal exhibits the lowest average test scores (439 in mathematics and 453 in science), while 
Singapore achieves the highest (623 and 577). We use the following variables to control for 
student and family background: the student’s sex, age, and country of birth, data on whether the 

8 Deviations from this general rule for the sampling of schools and students are documented in Martin and 
Kelly (1998: Appendix B). 
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student is living with both parents, and parental education and the number of books in the 
student’s home (both categorical variables with five categories). 

Appendix Tables A1 and A2 compare the sample of students included in our study to the full 
sample of students tested by TIMSS. The highest share of students excluded in our mathematics 
sample is in Iceland (55 percent), and it is Canada in our science sample (75 percent). At the 
opposite extreme, less than 2 percent of the tested students in either mathematics or science were 
excluded in Japan. The difference in the average performance between the included and the full 
sample of students is quite small in all the countries, except for science performance in Iceland, 
where the difference is 9 test-score points. There are also almost no substantial differences in the 
student- and family-background data for the included and the full samples of students. The 
largest differences by far are that the share of female students included in the French school 
system of Belgium is 4.2 percentage points larger than the original share in mathematics (6.7 
percentage points in science), and that the share of parents who finished university in Iceland is 
5.9 percentage points smaller in our mathematics sample (5.2 percentage points in science). In 
the science sample, the share of parents with a university degree is also smaller in Canada (6.1 
percentage points), while the share of parents with some education after secondary school is 
larger in Romania (6.1 percentage points). Apart from these relatively minor exceptions, 
however, the sample of students that we include in our study is very similar to the full sample of 
students tested in TIMSS, making us confident that the exclusion of students is unrelated to our 
variables of interest and thus does not introduce bias to our estimation. 

Tables 3 and 4 present descriptive statistics on class size. The smallest average class size of 
20.3 students per class is found in Iceland, closely followed by the two Belgian school systems 
(column (1)). With an average of 56.9 students per class in mathematics and 48.8 in science, 
Korea has the largest classes by far. The other East Asian countries also feature relatively large 
classes of more than 30 students. The country averages of the grade-average class size in a school 
(column (2)) are generally quite similar to actual class sizes, except for the fact that Korea’s 
grade-average class size is only 50.5 students in mathematics. The amount of within-country 
variation in grade-average class sizes is somewhat smaller than the variance in actual class sizes. 
This is of course what we would expect, as outlying cases of extremely small and large tested 
classes are balanced out by other classes within the same grade. 
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Column (3) of Tables 3 and 4 reports the class-size difference between the seventh- and 
eighth-grade classes actually tested in each school. On average, there are no sizable differences in 
class size between seventh and eighth grade. The only exceptions are Korea and Singapore, 
where on average over all schools, the eighth-grade classes have between 4.2 and 6.9 students 
more than seventh-grade classes. In Korea, these differences vanish once we look at the 
difference in the grade-average class size (column (4)). Thus, there do not seem to be 
institutional differences within countries in the rules governing class size between seventh and 
eighth grade, with the exception of Singapore. Even there, any effect of this rule on our estimates 
of class-size effects should be controlled for by the inclusion of a grade dummy in the estimation, 
as long as the rule is unrelated to student performance. 

As outlined above, our estimation strategy focuses on the difference in class size between 
seventh and eighth grade within each school. The standard deviations reported in parentheses in 
the first four columns of Tables 3 and 4 demonstrate that the variation in the grade difference in 
class size is by and large comparable to the variation in actual class sizes in every country. That 
is, our estimates of class-size effects on student performance draw from a range of class-size 
variations comparable to the actual variation in each country. 

The standard deviation in the between-grade difference in average class size ranges from 1.1 
in Hong Kong to over 6 in Spain and Singapore, with an average over the 18 countries in our 
sample of 3.5, or 13 percent of the average actual class size. In other words, our estimates of 
class-size effects draw on a range of variation that encompasses the range of feasible policy 
initiatives in most countries. Columns (5) and (6) of Tables 3 and 4 show the minimum and 
maximum of the difference in the average class size between seventh and eighth grade in a 
school for each country, providing further information on the range of variation in class sizes we 
are able to use. 

Exceptions with low variation in class size are Hong Kong and Scotland, where there is not 
much variation left once between-school variations as well as within-grade variations in a school 
are accounted for. The standard deviation of the between-grade difference in average class size is 
less than 2 in these two countries, while it is larger than 2 in all other countries. The largest 
positive class-size difference between eighth- and seventh-grade classes in a school is only 2 in 
Hong Kong, and the largest negative difference between eighth- and seventh-grade classes is only 



3. That is, there seems to be basically no between-grade variation in average class size within 
individual schools in Hong Kong and Scotland, leaving little variation in class size on which to 
base our estimation. 

In columns (7) and (9) of Tables 3 and 4, coefficient estimates of a simple regression of actual 
class size on grade-average class size are reported for each country. The regression reported in 
column (7) has no constant. As is evident, the estimates are very close to 1 in all countries. 
Column (8) reports the probabilities, based on a Wald test, that these estimates can be 
statistically significantly distinguished from 1. Even though these coefficients are very precisely 
estimated, they are statistically indistinguishable from 1 in most countries. This shows that the 
data on actual class size, collected from teachers, are consistent with the data on grade-average 
class size, collected from school principals; data from the different background questionnaires 
therefore seem compatible. Furthermore, these estimates confirm that the sampled classes are of 
the same size as the average class sizes of the grades of the sampled schools. Column (9) reports 
coefficient estimates of the same regression of actual class size on grade-average class size, this 
time with a constant included in the regression. These estimates are all smaller than 1 (with the 
exception of the Canadian science sample, where the estimate is very imprecise). This confirms 
that grade-average class sizes are larger than actual class sizes when actual class sizes are small, 
and smaller than actual class sizes when actual class sizes are large. Thus, the classes actually 
tested in TIMSS do indeed feature unusually small and large classes, which might reflect 
decisions to sort students of different ability levels into especially small or large classes. This 
reinforces the importance of our IV strategy, which enables us to use only that part of the 
variation in actual class sizes that is due to variations in grade-average class sizes. 

V. Estimation Results 

Estimates of class-size effects based on the different methods advanced in Section II for the 1 8 
countries in our sample are presented in Tables 5 to 8. The dependent variable in the results 
reported in Tables 5 and 7 is the TIMSS mathematics score, while in Tables 6 and 8 it is the 
TIMSS science score. To facilitate comparisons of the estimates across countries we use the non- 
standardized TIMSS test scores, which have an international mean of 500 and an international 
standard deviation of 100. All reported results control for grade level as well as for the complete 
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set of student- and family-background variables discussed in Section IV. All regressions are 
performed at the level of the individual student, which allows for a perfect matching of the 
student- and family-background controls to the performance of each student. 

In each of our estimations, attention was given to the complex data structure produced by the 
survey design and the multi-level nature of the explanatory variables. To achieve nationally 
representative student samples, TIMSS used stratified sampling within each country, which 
produced varying sampling probabilities for different students (Martin and Kelly 1998). Thus, all 
estimations are weighted by students’ sampling weights in order to obtain nationally 
representative coefficient estimates from the stratified survey data. This ensures that the 
contribution of the students from each stratum in the sample to the parameter estimates is the 
same as would have been obtained in a complete census enumeration (DuMouchel and Duncan 
1983). 

Furthermore, the explanatory variable of interest in our study, class size, is measured at a 
different level than the dependent variable, student performance. As shown by Moulton (1986), 
such a hierarchical structure of the data requires the addition of a higher-level error component to 
avoid spurious results. Thus, the error terms in equations (1) to (4) have a class-specific error 
component v c in addition to the conventional student-specific error component £, cgs . The 
clustering-robust linear regression (CRLR) method delivers consistent estimates of standard 
errors in the presence of hierarchically structured data (cf. Deaton 1997). CRLR relaxes the usual 
assumption of independence of all observations and requires only that the observations be 
independent across classes, allowing any amount of correlation within classes. It thus lets the 
data determine the structure of the error components in these equations. 

A. Results of the WLS and SFE Methods 

Column (2) of Tables 5 and 6 reports the coefficient on class size a, from a standard least- 
squares estimation as in equation (1). More than half of these weighted least-squares (WLS) 
estimates in mathematics, and nearly half the estimates in science, have a statistically significant 
positive sign; students in larger classes apparently performed significantly better than students in 
smaller classes. 9 In other words, the naive WLS estimation method leads to the counterintuitive 



9 These estimates confirm the results of Hanushek and Luque (2002), who estimate class-size coefficients for 
mathematics performance in TIMSS using ordinary least squares (OLS) and find statistically significant positive 
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result that students fare better in larger classes. Moreover, this result seems quite universal: It 
emerges in Western Europe (e.g., Belgium, France), in Eastern Europe (e.g., Czech Republic, 
Romania), in Australia, and in East Asia (e.g., Hong Kong, Japan). These results immediately 
suggest a problem with the WLS method. The only cases with statistically significant negative 
coefficients on class size on the basis of the WLS method are Korea in mathematics and Iceland 
and Scotland in science. 

Results of the estimation method that takes into account school fixed effects (SFE) as in 
equation (2) are presented in column (4) of Tables 5 and 6. These estimates of the coefficient a 2 
control for any between-school differences in student ability or educational quality. The number 
of countries with statistically significant positive coefficient estimates decreases to about half the 
number found with the WLS method. On the other hand, there is only one additional statistically 
significant negative estimate (in science). The increased prevalence of statistically insignificant 
results cannot be attributed to a lower degree of precision in our estimates. On average over the 
18 countries, the standard deviation of the estimates actually decreases slightly from 0.628 in 
mathematics (0.490 in science) with the WLS method to 0.619 (0.469) with the SFE method. 
There seems instead to be less evidence of any relationship between class size and student 
performance once between-school differences are eliminated. Still, there remain a large number 
of counterintuitive results, as 10 out of the total of 36 estimates exhibit a statistically significant 
positive sign. As discussed before, the a 2 estimates may be contaminated by the effects of within- 
school sorting. 

B. First- and Second-Stage Results of the SFE-IV Method 

The final identification strategy presented in Section II was designed to eliminate any effect of 
between- and within-school sorting from our class-size estimates by combining school fixed 
effects with an instrumental variable approach (SFE-IV). The correlation between our 
instrument, the grade-specific average class size in the school, and the endogenous explanatory 
variable, actual class size, was already reported in columns (7) to (9) of Tables 3 and 4. It was 

estimates in the majority of countries. Hanushek and Luque (2002) use only classroom-level rather than student-level 
data, and their controls for student background are inferior to the detailed data on individual students used in this 
paper as they do not use the student background questionnaire. Thus, although they can control for a few school- 
level indicators based on principals’ assessments, they lack such information as parental education or the number of 
books in an individual student’s home. 
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shown that there is a strong and statistically highly significant correlation between actual class 
size and grade-average class size within all countries in both mathematics and science, with only 
3 exceptions. Once controlling for a constant, the coefficient on grade-average class size was 
statistically insignificant in Flemish Belgium and Korea in mathematics and in Scotland in 
science. However, the estimates reported in Tables 3 and 4 contained no further controls as 
additional right-hand-side variables. 

Column (1) of Tables 7 and 8 reports the coefficient p on grade-average class size of the first- 
stage regression of the 2SLS estimation of our SFE-IV method (equation (4) in Section II), where 
school fixed effects, grade level, and the whole set of student- and family-background variables 
are controlled for. Even after controlling for these factors, grade-average class size remains 
highly correlated with actual class size in nearly all cases. Exceptions with statistically 
insignificant estimates include the 3 cases mentioned above, the United States in mathematics, 
and Australia, Hong Kong, Korea, and the United States in science. 10 In these cases, the grade- 
average class size does not retain any useful information as an instrument for actual class size 
after controlling for school fixed effects, grade level, and background characteristics. That is, our 
instrument in these countries is quite poor, and our preferred identification strategy cannot be 
properly applied. It may be that in these countries, the relevant subject (mathematics or science) 
is taught in special classes, created for example by breaking down or rearranging regular classes. 
Such a policy would explain why classes in these subjects do not appear to be of the same size as 
typical classes in the relevant grade. 

The estimates of class-size effects a } based on our SFE-IV method (equation (3) in Section II) 
are presented in column (5) of Tables 7 and 8. As explained in Section II, this method excludes 
any variation caused by between- and within-school sorting, so the coefficient a 3 can be 
interpreted as an unbiased estimate of the causal effect of class size on student performance. The 
most notable feature of our SFE-IV results is the disappearance of the counterintuitive, 
statistically significant positive coefficients on class size in all but one case, namely Portugal in 
mathematics. We find a statistically significant negative coefficient on class size in France and 
Iceland in mathematics, as well as in Greece and Spain in science. In these four cases, smaller 

10 The coefficient estimate in the United States in science actually has a negative sign and is statistically 
significant at the 10 percent level. 
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classes seem to produce superior student performance. In the vast majority of cases, however, the 
estimated coefficient is not statistically significantly different from zero. 

In what follows, we discuss these results in greater detail. Section V.C compares the three 
identification methods in terms of the sign and significance level of the estimated class-size 
effects they produce. Section V.D comments on the precision of our SFE-IV estimates, while 
Section V.E gives a detailed assessment of their magnitude. In the end, it is the potential size of 
any class-size effect that decides whether a class-size reduction will be worth its costs. While 
many of our estimates are statistically indistinguishable from zero, they may offer for meaningful 

o 

conclusions if they allow us to reject the existence of sizable class-size effects. 

C. Comparison of the Three Methods 

A comparison of the estimates of class-size effects based on the three methods is revealing. 
Imagine, for example, that we were to conduct a meta-analysis of our estimates similar to the 
meta-analyses in the surveys of class-size estimates conducted by Hanushek (1986, 1996) and 
Krueger (2000). Figure 4 depicts the distribution of the total of 36 estimates - pooling 
mathematics and science results - into statistically significant positive, statistically insignificant 
positive, statistically insignificant negative, and statistically significant negative categories for 
each of the three methods. Taking the WLS estimates at face value, we would have to conclude 
that in more than half the school systems in our sample larger classes produce better student 
performance. Only in 6 of the 36 cases would a (statistically significant or insignificant) negative 
coefficient be detected - indicating that students learn more in smaller classes. With the SFE 
method, we would still find a statistically significant positive coefficient in more than a quarter 
of the cases. Among the statistically insignificant estimates, the relative number of negative signs 
increases. 

Using our SFE-IV identification method, we do not detect a statistically significant effect of 
class size on student achievement for most school systems in our. sample. In four cases, however, 
we observe that smaller classes have led to a superior level of student performance. Only in one 
case do we obtain a counterintuitive statistically significant positive effect. 11 The statistically 

1 1 This pattern of results contrasts with Hanushek and Luque’s (2002) conclusion, also based on TIMSS data, 
that sorting effects do not heavily influence estimates of class-size effects. Their assessment relies primarily on the 
use of weak proxies in an attempt to restrict their analysis to schools with only one class per grade, and it does not 
address the possibility of student sorting at the between-school level. 
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insignificant estimates are rather evenly split between positive and negative results, with a slight 
majority negative. 

D. Precision of the SFE-IV Estimates 

The question arises whether the increasing prevalence of statistically insignificant estimates of 
the class-size coefficient with the SFE-IV method relative to the other methods reflects a genuine 
lack of a causal impact of class size on student performance, or whether it is just due to a lack of 
precision of the SFE-IV method. In several cases, the standard error of the estimate of a 3 is 
extremely large. This is the case for five countries in mathematics and for three countries in 
science. These countries are Australia (standard error of 3.9 in mathematics and 9.5 in science), 
Hong Kong (7.2 and 12.8), and Scotland (6.3 and 51.9) in both subjects, plus Flemish Belgium 
(6.7) and the United States (69.6) in mathematics. 

The lack of precision in these cases seems to be a direct consequence of the rather demanding 
data requirements of our identification strategy, as we can account for them in the following 
ways. It is obvious that the quality of the instrument as depicted by its statistical significance in 
the first-stage estimation is directly reflected in the precision of the estimates of the second-stage 
estimation. Flemish Belgium and the United States in mathematics, as well as Australia, Hong 
Kong, and Scotland in science, were all cases with statistically insignificant estimates in the first 
stage. This leaves the cases of Australia, Hong Kong, and Scotland in mathematics. 

For Hong Kong and Scotland, we saw that there was basically no variation in the average 
class size between the two grades in a school (Section IV). The largest between-grade difference 
in average class size, positive or negative, observed in mathematics in any school in Hong Kong 
is only 3, and it is only 5 in Scotland (columns (5) and (6) of Table 3). That is, in these two 
countries there is simply not much of the within-school variation in grade-average class size on 
which our estimation strategy relies. Similarly, in Australia, Scotland, and the United States 
approximately 50 percent of the sampled schools exhibit no difference in average class size 
between the two grades, and in all three countries this is true both in mathematics and in science. 

The reduced-form association between student performance and grade-average class size, 
reported in column (3) of Tables 7 and 8, confirms that the extremely imprecisely estimated 
outliers in the estimates of class-size effects are indeed consequences of weak instruments in 
these cases. In the reduced-form results, the extreme values vanish both among the coefficient 
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estimates and among their standard errors. This underscores the weakness of the instrument in 
these cases; if there were any causal class-size effect in these cases, the instrument would be too 
weak to detect it. 

Thus, the five cases in mathematics and three cases in science with extremely imprecise 
estimates of a 3 can be attributed to data insufficient to implement the demanding SFE-IV 
identification strategy. Excluding these cases, however, the standard errors of the estimates of our 
identification strategy SFE-IV are only about half a test-score point larger than the standard 
errors of the estimates produced by the less demanding WLS and SFE methods. Excluding the 
five countries with standard errors larger than 3.9 in mathematics (Australia, Flemish Belgium, 
Hong Kong, Scotland, and United States), the average standard error of the remaining 13 
countries is 1.022 with the SFE-IV method, compared to 0.583 with the WLS method and 0.594 
with the SFE method. Similarly, excluding only the three countries with standard errors larger 
than 9 in science (Australia, Hong Kong, and Scotland) leaves an average standard error among 
the other 15 countries of 1.151 with the SFE-IV method, compared to 0.440 with the WLS 
method and 0.450 with the SFE method. 

A standard error of approximately 1 is equal to the effect of a class-size reduction leading to a 
gain of 1 test-score point per student. This corresponds to a reduction in class size by 5 students 
leading to an increase in student performance by 5 test-score points, or only 5 percent of the 
international standard deviation in TIMSS test scores. In other words, a class-size reduction of 5 
students that produced an increase in test scores of only 10 points, or 10 percent of a standard 
deviation, would be statistically significantly estimated at the 5 percent confidence level with our 
SFE-IV method. Apart from the 8 out of 36 cases with extremely large standard errors, therefore, 
the estimates produced with the SFE-IV method seem precise enough to pick up any sizable 
class-size effect. 

E. Magnitude of the Class-Size Effect 

Given the precision of the SFE-IV estimates in the remaining 28 cases, we can now assess 
whether there are any sizable class-size effects in educational production in these cases. As most 
of the previous studies that build on exogenous variations in class size by using an experimental 
or quasi -experimental design have been implemented for the United States, it seems sensible to 
compare the magnitude of our estimates of class-size effects in different countries to the previous 



estimates from the United States. The problem in this is that the magnitude of the existing 
estimates of causal class-size effects varies widely even within the United States. On the one 
hand, Krueger (1999) finds in his analysis of Project STAR in Tennessee a quite substantial 
increase in student performance due to the experimental reduction in class size. On the other 
hand, Hoxby (2000) provides quasi-experimental evidence from Connecticut that rules out the 
existence of even very modest causal effects of class size on student performance. 12 

As not even the studies on the United States come to conclusive results, we chose to assess the 
magnitude of our estimated effects for other school systems by comparing them to those 
produced by Krueger (1999), which lie at the upper bound of estimates produced so far. Krueger 
presents a very rough cost-benefit analysis based on these estimates suggesting that the economic 
benefits in terms of increased future earnings due to improved test scores caused by reducing 
class size fall in the same ballpark as the costs. At least in the United States, then, the benefits of 
smaller classes would have to be of roughly this same magnitude in order for class-size 
reductions to be cost effective. Krueger (1999: 530) found that the students in classes that were 7 
to 8 students smaller on average than regular-sized classes performed about 0.22 standard 
deviations of a test score better. This means that students performed about 3 percent of a standard 
deviation better for every 1 student less in the class. In terms of the international TIMSS test 
score, this is equivalent to 3 test-score points. 

None of our statistically significant point estimates of class-size effects, presented again in 
column (1) of Tables 9 and 10, is as large as 3 (in absolute terms). However, in three of the four 
cases in which we find a statistically significant negative coefficient on class size, the value of 
this coefficient is larger in absolute terms than 2.4. These are France and Iceland in mathematics 
and Greece in science. That is, in three out of the 28 reasonably precisely estimated cases we do 
find point estimates that are not too distant from the order of magnitude presented by Krueger. 

As most of our class-size estimates are statistically insignificantly different from zero, we next 
consider whether we can reject with reasonable confidence an effect of the magnitude of 
Krueger’s estimates. Columns (3) and (4) of Tables 9 and 10 present results of Wald tests that 



12 Angrist and Lavy’s (1999) estimates for Israel lie somewhere in between these two extremes. 
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test whether our estimated coefficients are statistically significantly different from -3. 13 For eight 
countries in mathematics, and also for eight countries in science, the tests reject a class-size 
effect of that order of magnitude at the 1 percent confidence level. In another three cases, such an 
effect is rejected at the 5 percent confidence level, and in another two cases at the 10 percent 
level. Thus, in 16 to 21 (depending on the degree of confidence) of the 28 rather precisely 
estimated class-size effects, we can reject a class-size effect of the order of magnitude of 
Krueger’s (1999) estimates. This is not to say that we can reject any class-size effect of any order 
of magnitude whatsoever in these cases. It only shows that we can be rather confident that the 
causal effect of class size on student performance is not as large as the one estimated by Krueger 
for the Project STAR. 

To assess whether even smaller class-size effects can be rejected for specific school systems, 
columns (5) and (6) of Tables 9 and 10 test whether we can reject that a class smaller by one 
student leads to an improvement of student performance by only a single TIMSS test- score point 
(equivalent to 1 percent of an international standard deviation). We can reject even such a small 
impact in three cases at the 1 percent level, and in a total of eight cases at the 10 percent level. In 
many cases, therefore, our identification strategy has considerable power to identify the existence 
of class-size effects. 

In sum, we can split our total of 36 estimates of class-size effects from different school 
systems into four (slightly overlapping) broad categories: First, a group of four cases in which we 
find a statistically significant beneficial effect from smaller classes (France and Iceland in 
mathematics, Greece and Spain in science); second, eight cases where we can reject any sizable 
class-size effect with reasonable confidence (Japan and Singapore in both subjects, plus French 
Belgium, Canada, and Portugal in mathematics and Romania in science); third, another thirteen 
cases where we can reject class-size effects of the order of magnitude reported by Krueger (1999) 



13 While -3 would be the order of magnitude of Krueger’s (1999) estimates in terms of standard deviations of 
the international test score (which has a standard deviation of 100), the standard deviations of the test scores within 
each country vary around 100 (see column (4) of Tables 1 and 2). These within-country standard deviations of test 
scores range from 63.6 (in Portugal in mathematics, which is an outlier at the lower bound) to 108.0 (in Korea in 
mathematics). On average across the countries in our sample, the within-country standard deviation is slightly less 
than 100. To estimate the magnitude of the class-size effects in terms of the standard deviation of test scores within 
each country, we also did the Wald tests in terms of -0.03 of a within-country standard deviation. This did not 
introduce any substantive changes to the results presented in columns (3) and (4) of Tables 9 and 10. Thus, we chose 
to present the tests relative to the same value of-3 in each country in order to maintain direct comparability across 
countries, which is feasible because the test scores have been scaled in the same way for all countries. 





with reasonable confidence (Flemish Belgium, Czech Republic, Korea, Slovenia, and Spain in 
both mathematics and science, plus French Belgium, France, and Portugal in science); 14 and 
fourth, a group of twelve cases where we cannot say any of these things about the class-size 
effect with a reasonable degree of confidence on the basis of our identification strategy (the eight 
cases with extremely imprecise estimates referred to before except for Flemish Belgium, plus 
Greece and Romania in mathematics and Canada, Iceland, and the United States in science). 
These results confirm that the question of whether there are sizable class-size effects in 
educational production is one that has to be answered separately for each school system. In 
Appendix 2, we show that our results on class-size effects are robust against several alternative 
specifications of the estimated relationship and against several peculiarities of the dataset. 

F. Interpretation of the Results 

When interpreting the results, it should be noted that there are many aspects of the level and 
quality of educational resources that may influence student performance, of which class size is 
only one. These other classroom inputs, however, are also likely to be endogenous. Lacking 
suitable instruments for these variables, we were forced to restrict our analysis to the effects of 
class size. To the extent that they are correlated with grade-level average class sizes, any class 
size effects we identify could actually be attributable to these other factors. Therefore, our 
estimates are most precisely interpreted as the effects on student achievement of class size and all 
other resource inputs with which it is associated (cf. Boozer and Rouse 2001). If smaller classes 
are also more likely to receive more of other resources, our results may overstate the effect of 
class size on achievement. 

Another issue to be addressed is our use of level scores as opposed to gain scores as our 
measure of student achievement. Because students in the TIMSS sample were only tested at a 
single point in time, our data do not support the estimation of value-added models of educational 
production. Level formulations of the kind we use instead essentially rely on the similarity in the 
size of students’ classes over the course of their recent careers. To the extent that this assumption 
is violated, our estimated class-size effects will be biased towards zero. Confidence in the 
validity of this assumption for our purposes, however, is increased by the fact that our 

14 Note that the science estimate in Spain belongs to both the first and the third group, as it is estimated 
precisely enough to reject both that it is equal to zero and that it is equal to -3 with reasonable confidence. 
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identification strategy is explicitly designed to identify only those variations in class size caused 
by natural differences in student enrollment between adjacent grades in a school, which should 
be relatively constant over time. Moreover, the TIMSS exam was itself designed to test concepts 
in mathematics and science covered during the middle school years, further minimizing the 
potential bias resulting from this form of measurement error in our explanatory variable. In our 
specific case, therefore, the use of level scores seems quite plausible, and may even be superior 
to the use of value-added measures given the latter’s greater unreliability (Kane and Staiger 
2001 ). 

Finally, in addition to estimating the causal effect of class size on student performance, our 
identification strategy allows us to quantify the extent to which students’ levels of performance 
affect the relative size of the class in which they are taught. The large differences in the estimated 
coefficients on class size between our three different methods of estimation (see Tables 5 to 8) 
suggest that there is substantial sorting of students according to achievement levels in most of the 
school systems we analyze. West and WoBmann (2002) show that the nature (within or between 
schools), direction, and magnitude of the sorting effects in the different school systems can be 
linked to such likely sources of student sorting as student and family mobility, distribution of 
responsibility for the placement of students and classes, academic selectivity of schools, and 
availability of remedial or enrichment teaching, giving additional confidence in the plausibility 
and importance of our identification strategy. 

VI. Conclusion: Where to Look for Class-Size Effects 

Are there sizable class-size effects in educational production? Our results suggest that the answer 
to this question depends on which school system you are looking at. It is possible to boil down 
the pattern of our 36 class-size estimates to a basic picture for the 18 countries, ignoring 
differences between the two subjects, without doing too much harm to the detailed findings 
presented above. In four countries - Australia, Hong Kong, Scotland, and the United States - our 
identification strategy leads to extremely imprecise estimates that do not allow for any confident 
assertion about class-size effects. In two countries — Greece and Iceland — there seem to be non- 



trivial beneficial effects of reduced class sizes. 15 France is the only country where there seem to 
be noteworthy differences between mathematics and science teaching: While there is a 
statistically significant and sizable class-size effect in mathematics, a class-size effect of 
comparable magnitude can be ruled out in science. The nine school systems for which we can 
rule out large-scale class-size effects in both mathematics and science are the two Belgian 
systems, Canada, the Czech Republic, Korea, Portugal, Romania, Slovenia, and Spain. 16 Finally, 
we can rule out any noteworthy causal effect of class size on student performance in two 
countries, Japan and Singapore. 

In short, class-size effects estimated in one school system cannot be interpreted as a general 
finding for all school systems. In the majority of countries in our sample (1 1 out of 18), we can 
be quite confident that the effect of class size on student performance is not as large as the one 
Krueger (1999) found for the Project STAR. Given that in Krueger’s (1999) own analysis of 
class-size reductions, the benefits only marginally outweigh the costs, this raises considerable 
doubts about the desirability of class-size reductions as a policy intervention in most of the 
school systems we examine. However, the results for individual countries are much more 
diverse. While at one extreme, Greece and Iceland do seem to show sizable class-size effects, 
there seem to be no class-size effects whatsoever in Japan and Singapore. In these two school 
systems, our estimates resemble Hoxby’s (2000: 1280) “rather precisely estimated zeros”. 

The existence of class-size effects in Greece and Iceland, and their total absence in Japan and 
Singapore, raises the question of why class-size effects exist in some school systems, but not in 
others. The answer to this question should indicate to policymakers when class-size reductions 
are most likely to be effective. One might expect the existence of class-size effects to be related 
to such characteristics of a country as its level of development or its overall level of resources. 
However, columns (1), (3), and (7) of Table 11 demonstrate that there is no clear pattern in 
countries’ GDP per capita or average class size that distinguishes countries where substantial 
class-size effects do exist (mainly Greece and Iceland) from those where no class-size effect 

15 This assertion rests on the statistically significant sizable estimates for Greece in science and for Iceland in 
mathematics. The estimates for Greece in mathematics and for Iceland in science are less clear-cut, but cannot rule 
out a sizable effect. Actually, the mathematics estimate in Greece is statistically significant at a confidence level of 
1 3 percent, and the reduced-form estimate is statistically significant at a confidence level of 8 percent. 

16 The rejection of a class-size effect of the Krueger magnitude for Canada in science and for Romania in 
mathematics is statistically significant at the 15 percent level only. 



exists (Japan and Singapore), or from the larger group of 9 school systems where large class-size 
effects can be ruled out (“no-large-CSE”). If the main influence were diminishing returns to 
resource inputs, one would expect the countries with notable class-size effects to be those with a 
lower GDP per capita and with larger class sizes. While Greece’s GDP per capita is slightly 
below the mean of the countries where we rule out large class-size effects, Iceland’s is above it; 
and while class sizes in Greece are similar to the mean of the no-large-CSE sample, in Iceland 
they are substantially lower. Thus, the existence of class-size effects does not seem to be driven 
by diminishing returns. 17 

Additionally, the countries with significant class-size effects perform below average in terms 
of overall achievement on the TIMSS tests (column (5) of Table 1 1), while the countries where 
even small effects are ruled out perform above average. That is, the significant class-size effects 
in Greece and Iceland do not suggest that these are especially “effective” systems. Quite to the 
contrary, they achieve much lower performance levels than Japan and Singapore despite much 
smaller classes. The significant class-size effects in Greece and Iceland simply imply that class- 
size reductions would work to raise student performance within their current institutional 
environments, which as a whole are rather ineffective. 

To understand the existence of class-size effects (and the lack thereof), we have to turn to 
other characteristics of the different school systems. Columns (8) to (1 1) of Table 1 1 suggest that 
the overall level of educational spending is relatively low in Greece and Iceland. Columns (8) 
and (9) take data from Lee and Barro (2001) for 1990 (their latest available year), while columns 
(10) and (11) have data from the OECD for 1994. As each of these datasets is available for a 
different sample of countries, we present both. All these indicators suggest that, both in absolute 
terms and relative to the countries’ GDP per capita, educational expenditures per student in 
Greece and Iceland are substantially below the average of the subset of countries without class- 
size effects. 

Given that class sizes in these countries are equal to (Greece) or below (Iceland) the mean 
class size of the countries without sizable class-size effects, these expenditure data suggest that 
Greece and Iceland spend rather little per employed teacher. This is indeed reflected in the 



17 This confirms previous findings based on standard OLS estimates of class-size coefficients (Hanushek and 
Luque 2002). 
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available data on teacher salaries. Columns (12) to (16) present data on teacher salaries in the 
different countries. Lee and Barro’s (2001) teacher-salary data (columns (12) and (13)) are 
available only for primary-school teachers in 1990, while the OECD data (columns (14) to (16)) 
refer to teachers in lower secondary education in 1994. Teacher salaries in Greece and Iceland are 
below the mean of the no-large-CSE countries, both in absolute terms, in terms of salary per 
teaching hour, and relative to the country’s GDP per capita, which might be viewed as a proxy 
for the overall salary level in a country and thus as the opportunity cost of becoming a teacher. 
Conversely, teacher salaries seem to be above average in Japan and Singapore. 

A low average salary level for teachers probably means that a country is drawing its teaching 
population from a relatively low level of the overall capability distribution of all employees in 
this country. If this is the case, the different countries seem to have chosen different points on the 
quantity-quality tradeoff with respect to teachers: Greece and Iceland have relatively many but 
poorly-paid teachers, while Japan and Singapore have relatively few but well-paid teachers. 

The assumption that paying teachers less would lead to a lower average level of capability in 
the teacher population also seems to be borne out by the available data on teacher quality. In 
Greece, the highest level of education reached by the vast majority of teachers is the equivalent 
of a BA without any teacher training (columns (17) to (22) of Table 1 1), based on the sample of 
teachers of the TIMSS students. In Iceland, about a third of the teacher population does not even 
have a proper degree of secondary education, but only some basic teacher training. In both 
countries, the share of teachers with the equivalent of an MA or Ph.D. is very small, at about 2 to 
3 percent. Meanwhile, in the sample of countries without large class-size effects, more than 60 
percent of the teachers received more education than a BA without additional training, and nearly 
20 percent have an MA degree. Judging solely from teachers’ educational levels, therefore, 
Greece and Iceland appear to have a population of teachers that is less capable on average than 
the population of teachers in the 1 1 countries where we can reject the existence of large class- 
size effects. 

Thus, the evidence on class-size effects presented in this paper suggests the interpretation that 
capable teachers are able to promote student learning equally well regardless of class size (at 
least within the range of variation that occurs naturally between grades). In other words, they are 
capable enough to teach well in large classes. Less capable teachers, however, while perhaps 
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doing reasonably well when faced with smaller classes, do not seem to be up to the job of 
teaching large classes. This interpretation is corroborated by the responses given by teachers 
sampled in TIMSS when asked to what extent their teaching was limited by a high 
student/teacher ratio in their classroom. While 48 percent of teachers in Greece and 42 percent in 
Iceland reported that their teaching was limited “a great deal” by a high student/teacher ratio 
(column (23) of Table 1 1), the percentage of teachers who gave this response averaged only 22 
percent across those countries with no large class-size effects, and it was similarly low in Japan 
and Singapore. Given that actual class sizes in Greece and Iceland are, on average, smaller than 
those in Japan, Singapore, and the group average of countries without substantial class-size 
effects, this response pattern is suggestive both of differences in the quality of teachers in the two 
groups of countries and of the plausibility of the link between these differences and the existence 
of class-size effects. 

The explanation we propose jointly explains why class-size effects exist in some countries but 
not in others, and why the countries where sizable class-size effects do exist are those with a poor 
overall performance level: Greece and Iceland exhibit class-size effects and poor overall 
performance because they have a population of relatively less capable teachers, while Japan and 
Singapore (and, to a lesser extent, the other countries for which large class-size effects are ruled 
out) exhibit no class-size effects but high overall performance because they have a population of 
relatively capable teachers. An apparent implication of our research, therefore, is that it may be 
better policy to devote the limited resources available for education to employing more capable 
teachers rather than to reducing class sizes - moving more to the quality side of the quantity- 
quality tradeoff in the hiring of teachers. The merits of this admittedly speculative conclusion 
seem a promising topic for future research. 
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Table 1: Descnptive Statistics: Sample Size, Student Performance, and Student Background in the Mathematics Sample 

(l)-{3). Absolute numbers. - (4)-(17): Weighted means; standard deviations in parentheses. 
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Table 2: Descriptive Statistics: Sample Size, Student Performance, and Student Background in the Science Sample 

0M3). Absolute numbers. — (4)-( 1 7); Weighted means; standard deviations in parentheses. 
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Table 3: Descriptive Statistics: Class Size in the Mathematics Sample 

(I W): Weigh, M Ma **«. in pm. - 0H«: Ahclni, „„he,. - (7H»): Co-fcta, .r. reg^icn „f „„ d» fc rob „ „ 
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Significance levels (based on clustering-robust standard errors): ' 1 percent. — f 5 percent. — * 10 percent. 
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Table 5: Least-Squares and Fixed-Effects Estimates in Mathematics 

Estimates of the coefficient on class size. Dependent variable: Mathematics test score. Controlling for 
grade level and 12 student- and family-background variables. Clustering-robust standard errors in parentheses. 
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Significance levels (based on clustering-robust standard errors): * 1 percent. — f 5 percent. — * 10 percent. 



Table 6: Least-Squares and Fixed-Effects Estimates in Science 

Estimates of the coefficient on class size. Dependent variable: Science test score. Controlling for 
grade level and 12 student- and family-background variables. Clustering-robust standard errors in parentheses. 
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Table 7: Class-Size Effects in Mathematics in 18 Countries 
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a SFE-1V: School fixed effects and instrumental variables. — See text for details on the method of estimation. 
Significance levels (based on clustering-robust standard errors): * 1 percent. — f 5 percent. — 1 10 percent. 



Table 8: Class-Size Effects in Science in 18 Countries 

Estimates of the coefficient on class size (grade-average class size in columns (1) and (3)). 

Dependent variable in column (1): Actual class size. Dependent variable in columns (3) and (5). Science test score. 

Controlling for school fixed effects, grade level, and 12 student- and family-background variables. Clustering-robust standard errors in parentheses. 
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Table 9: Tests of the Magnitude of the Class-Size Effect in Mathematics 



CD 

in 





LO 

LQ 



Table 10: Tests of the Magnitude of the Class-Size Effect in Science 
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Table 11: Country Characteristics and the Existence of Class-Size Effects 
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Figure 1: Mathematics Performance by Class-Size Blocs in Singapore 
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Figure 2: Class Size and Mathematics Performance in Singapore 
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Figure 3: Class Size and Mathematics Performance in Iceland 

Test Score 




Test Score 






Figure 4: The Coefficient on Class Size 3 
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a Number of cases showing a statistically significant positive (black), a statistically insignificant 
positive (white), a statistically insignificant negative (light gray), and a statistically significant negative 
(dark gray) coefficient, respectively — WLS: Weighted least squares.— SFE: School fixed effects — 
SFE-IV: School fixed effects and instrumental variables. — See text for details on the methods of 
estimation. 
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Appendix 1: The Sample of Countries 

Originally, 46 countries participated in TIMSS. As Argentina, Indonesia, and Italy were unable to 
complete the steps necessary to appear in the data base, Mexico chose not to release its results, 
and Bulgaria, the Philippines, and South Africa had insufficient data quality for the background 
data to be included in the international data base, performance and background datasets were 
available for 39 countries. 

Data limitations made the implementation of our identification strategy impossible in a 
number of countries. Israel and Kuwait tested only eighth-grade students and no seventh-grade 
students. In Sweden, the seventh grade is in elementary schools, while the eighth grade is in 
secondary schools, so that there is no single school in the sample with both a seventh-grade and 
an eighth-grade class in it. Ninth-grade classes, which were additionally tested in both Sweden 
and Switzerland, could not be used as no information on grade-average class size was available 
for these classes. In England and Hungary, the question on grade-average class sizes was not 
administered in the school-principal background questionnaire. 

In a couple of countries, response rates on the class-size questions in the teacher and the 
school-principal background questionnaires were dismal. For example, data on the actual class 
size from the background questionnaires of the mathematics teachers were missing for 68 percent 
of the sampled students in Austria, 59 percent in Thailand, 53 percent in the Russian Federation, 
and 45 percent in Switzerland. Data on the grade-average class size from the background 
questionnaires of the school principals were missing for 44 percent of the sampled students in 
Norway and for 43 percent in Germany. Thus, the following countries were excluded because 
they had less than 50 schools left in either math or science for whom the appropriate data were 
available: Austria, Colombia, Cyprus, Denmark, Germany, Iran, Ireland, Latvia, Lithuania, 
Netherlands, New Zealand, Norway, Russian Federation, Slovak Republic, Switzerland, and 
Thailand. 

This left us with our sample of 18 school systems: Australia, Flemish Belgium, French 
Belgium, Canada, Czech Republic, France, Greece, Hong Kong, Iceland, Japan, Korea, Portugal, 
Romania, Scotland, Singapore, Slovenia, Spain, and the United States. 
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Appendix 2: Robustness of the Results 

We checked our results for robustness against alternative specifications of the estimation 
equation and against peculiarities in the data. These robustness checks include using the log of 
class size, controlling for teacher characteristics, checking for imputed student- and family- 
background data, and checking for outliers. 

The first alternative specification is to use a different functional form for the class- 
size/performance relationship. While the analysis before used a linear form — as, for example, 
also applied by Angrist and Lavy (1999), among many others - Hoxby (2000) suggests using the 
natural logarithm of class size, consistent with the observation that the proportional impact of a 
one-student reduction in class size is greater the smaller the initial size of the class. Tables A3 
and A4 present the coefficients on the log of class size using each of the identification strategies 
applied above. As is apparent in columns (6) and (7), this adjustment produces only two 
noteworthy changes in our estimates generated using the SFE-IV method: In Korea in 
mathematics, the previously insignificant negative coefficient on class size becomes statistically 
significant at the 10 percent level, as does the positive coefficient on class size for science 
performance in Romania. A version of Figure 4 based on estimates using the log of class size 
would therefore contain an additional statistically significant result on each end of the 
distribution, bringing the total number of statistically significant estimates to five on the negative 
side and two on the positive. Our basic substantive conclusions regarding the magnitude of these 
effects, however, remain the same. 

We also checked whether our results are robust to a specification that includes variables 
controlling for teacher characteristics. These characteristics are the sex, age, years of experience, 
and level of education of the specific mathematics and science teacher in each class in the TIMSS 
sample. Results from the re-estimation of our regressions with teacher controls included are 
presented in Tables A5 and A6. The figures in columns (17) and (18) confirm the lack of any 
substantive changes in our estimates of causal class-size effects produced by the SFE-IV method. 
The estimated coefficients on the vast majority of the teacher variables across countries do not 
reach statistical significance. This suggests that excluding the teacher controls in the initial 
specification seems warranted in order to preserve degrees of freedom. Among the statistically 



significant teacher results, there is no clear pattern in the coefficients on teacher’s sex or age. The 
estimated coefficients on teaching experience are consistently positive, suggesting that, 
controlling for age, teacher’s experience may have a positive impact on student achievement. The 
statistically significant coefficients on the different educational levels of the teacher are mostly 
positive in mathematics, although this pattern is less clear in science. It is important to 
emphasize, however, that any interpretation of these estimated coefficients on teacher 
characteristics needs to take into account that, like other resource inputs in education, they are 
potentially endogenous with respect to student performance (see Section V.F). Lacking good 
instruments for these variables, their inclusion provides only limited additional information about 
causal influences on student achievement. 

The family-background data for which we control contain imputed values in cases where 
values were missing. The procedures used to generate these values are described in WoBmann 
(2000). While this allows for the inclusion of students for whom some family-background data 
was missing to have a full dataset for all participants in the test, the imputed values of the family- 
background data are no real data and might introduce uncertainties about the estimated effects. 
We have thus re-estimated the class-size effects under exclusion of all students with any missing 
value in the family-background data, which includes the data on the students’ sex and age, the 
data on whether the student was bom in the country and is living with both parents, and the data 
on parents’ education and the number of books at home. The results of the re-estimation without 
imputed background data are presented in Tables A7 and A8. Column (1) reports the number of 
students with full original data. The exclusion rate relative to our original samples is highest at 
19 percent in Greece (both in the mathematics and the science sample), and it is less than 1 
percent in Japan and Singapore. As is obvious from columns (2) to (7) of Tables A7 and A8, no 
substantial changes in the results occur. To note, the significance level of the SFE-IV estimate for 
Greece in science drops to 1 1.5 percent, although the coefficient estimate remains within 0.21 of 
the previous result. In essence, the estimates of class-size effects excluding observations with 
imputed background data remain substantively the same. 

In some countries, outliers of especially large or small classes are present in the dataset. It is 
not clear whether these outliers indeed represent actual large or small classes, or whether there 
are errors in the data. There are reasons for especially large or small classes to exist in reality. In 
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small villages, a student cohort might by chance be especially small, which would result in an 
especially small class size. Likewise, chronic illness of teachers might lead to particularly large 
classes in special cases. Very large classes do exist in a lot of countries, and this class-size 
variation might reasonably be used to estimate class-size effects. Nevertheless, it is always 
possible that outlying cases in the dataset are caused by misunderstandings of questionnaire items 
on part of the teacher or the school principal, by mistakes in writing when filling in the 
questionnaires, or simply by typing errors in the construction of the database. As we cannot tell 
whether an error exists in any particular case, we chose to leave any outlying cases in the 
database for our estimations. However, to check whether any of our results are driven by such 
outliers, we went through the data for each country and subject, excluded any obvious outliers, 
and re-estimated our results. None of the results, changed in any substantial way, so that we can 
be confident that our results are not driven by any outliers. In a few instances, the number of 
students in the database who were actually tested in a class was larger than the class size reported 
by the teacher. We replaced the reported class size by the number of tested students in these 
cases, continuing to leave out any outliers. Again, this had no noteworthy impact on our results. 
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Table Al: Comparison of Sample of Included Students to Full Sample in Mathematics 

(1), (2), and (4): Absolute numbers. - (5)-(18): Weighted means. Full sample in brackets. 
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Table A2: Comparison of Sample of Included Students to Full Sample in Science 

( 1 ), (2), and (4): Absolute numbers. - (5H 1 8): Weighted means. Full sample in brackets. 
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Table A3: Class-Size Effects in Mathematics Using the Log of Class Size 

Estimates of the coefficient on log class size. Dependent variable: Mathematics test score. Controlling for 
grade level and 12 student- and family-background variables. Clustering-robust standard errors in parentheses. 
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Table A4: Class-Size Effects in Science Using the Log of Class Size 

Estimates of the coefficient on log class size. Dependent variable: Mathematics test score. Controlling for 
grade level and 1 2 student- and family-background variables. Clustering-robust standard errors in parentheses. 
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WLS. Weighted least squares. SFE: School fixed effects. — SFE-IV: School fixed effects and instrumental variables. — See text for details on the methods of estimation. 
Significance levels (based on clustering-robust standard errors): * 1 percent. — 1 5 percent. — * 10 percent. 



Table A5: Class-Size Effects in Mathematics Controlling for Teacher Characteristics 

Estimates of the coefficient on class size. Dependent variable: Mathematics test score. Controlling for grade level 
1 2 student- and family-background variables, and the 6 teacher-background variables mentioned below. 
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Table A6. Class-Size Effects in Science Controlling for Teacher Characteristics 

Estimates of the coefficient on class size. Dependent variable: Mathematics test score. Controlling for grade level, 

12 student- and family-background variables, and the 6 teacher-background variables mentioned below. 
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Table A7: Class-Size Effects in Mathematics Excluding Observations with Imputed Background Data 

Estimates of the coefficient on class size. Dependent variable: Mathematics test score. Controlling for 
grade level and 12 student- and family-background variables. Clustering-robust standard errors in parentheses. 
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Table A8: Class-Size Effects in Science Excluding Observations with Imputed Background Data 

Estimates of the coefficient on class size. Dependent variable: Science test score. Controlling for 
grade level and 12 student- and family-background variables. Clustering-robust standard errors in parentheses. 
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